{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "307804a3-c02b-4a57-ac0d-172c30ddc851",
   "metadata": {},
   "source": [
    "# Chroma\n",
    "\n",
    ">[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.\n",
    "\n",
    "<a href=\"https://discord.gg/MMeYNTmh3x\" target=\"_blank\">\n",
    "      <img src=\"https://img.shields.io/discord/1073293645303795742\" alt=\"Discord\">\n",
    "  </a>&nbsp;&nbsp;\n",
    "  <a href=\"https://github.com/chroma-core/chroma/blob/master/LICENSE\" target=\"_blank\">\n",
    "      <img src=\"https://img.shields.io/static/v1?label=license&message=Apache 2.0&color=white\" alt=\"License\">\n",
    "  </a>&nbsp;&nbsp;\n",
    "  <img src=\"https://github.com/chroma-core/chroma/actions/workflows/chroma-integration-test.yml/badge.svg?branch=main\" alt=\"Integration Tests\">\n",
    "\n",
    "- [Website](https://www.trychroma.com/)\n",
    "- [Documentation](https://docs.trychroma.com/)\n",
    "- [Twitter](https://twitter.com/trychroma)\n",
    "- [Discord](https://discord.gg/MMeYNTmh3x)\n",
    "\n",
    "Chroma is fully-typed, fully-tested and fully-documented.\n",
    "\n",
    "Install Chroma with:\n",
    "\n",
    "```sh\n",
    "pip install chromadb\n",
    "```\n",
    "\n",
    "Chroma runs in various modes. See below for examples of each integrated with LangChain.\n",
    "- `in-memory` - in a python script or jupyter notebook\n",
    "- `in-memory with persistance` - in a script or notebook and save/load to disk\n",
    "- `in a docker container` - as a server running your local machine or in the cloud\n",
    "\n",
    "Like any other database, you can: \n",
    "- `.add` \n",
    "- `.get` \n",
    "- `.update`\n",
    "- `.upsert`\n",
    "- `.delete`\n",
    "- `.peek`\n",
    "- and `.query` runs the similarity search.\n",
    "\n",
    "View full docs at [docs](https://docs.trychroma.com/reference/Collection). "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b5331b6b",
   "metadata": {},
   "source": [
    "## Basic Example\n",
    "\n",
    "In this basic example, we take the a Paul Graham essay, split it into chunks, embed it using an open-source embedding model, load it into Chroma, and then query it."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
   "metadata": {},
   "source": [
    "#### Creating a Chroma Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3df0b97",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index\n",
    "!pip install langchain\n",
    "!pip install chromadb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "d48af8e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import\n",
    "from llama_index import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.vector_stores import ChromaVectorStore\n",
    "from llama_index.storage.storage_context import StorageContext\n",
    "from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
    "from llama_index.embeddings import LangchainEmbedding\n",
    "from IPython.display import Markdown, display\n",
    "import chromadb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "374a148b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up OpenAI\n",
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "667f3cb3-ce18-48d5-b9aa-bfc1a1f0f0f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
      "Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
      "Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
      "INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu\n",
      "Use pytorch device: cpu\n",
      "Use pytorch device: cpu\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1874 tokens\n",
      "> [get_response] Total LLM token usage: 1874 tokens\n",
      "> [get_response] Total LLM token usage: 1874 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, studying at RISD, living in a rent-controlled apartment, building an online store builder, editing code, publishing essays online, writing essays, working on spam filters, cooking for groups, buying a building, and attending parties.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# create client and a new collection\n",
    "chroma_client = chromadb.Client()\n",
    "chroma_collection = chroma_client.create_collection(\"quickstart\")\n",
    "\n",
    "# define embedding function\n",
    "embed_model = LangchainEmbedding(\n",
    "    HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-mpnet-base-v2\")\n",
    ")\n",
    "\n",
    "# load documents\n",
    "documents = SimpleDirectoryReader(\n",
    "    \"../../../examples/paul_graham_essay/data\"\n",
    ").load_data()\n",
    "\n",
    "# set up ChromaVectorStore and load in data\n",
    "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, embed_model=embed_model\n",
    ")\n",
    "\n",
    "# Query Data\n",
    "query_engine = index.as_query_engine(chroma_collection=chroma_collection)\n",
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "349de571",
   "metadata": {},
   "source": [
    "## Basic Example (including saving to disk)\n",
    "\n",
    "Extending the previous example, if you want to save to disk, simply initialize the Chroma client and pass the directory where you want the data to be saved to. \n",
    "\n",
    "`Caution`: Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stomp each other's work. As a best practice, only have one client per path running at any given time.\n",
    "\n",
    "`Protip`: Sometimes you can call `db.persist()` to force a save. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "9c3a56a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "INFO:chromadb.db.duckdb:loaded in 20 embeddings\n",
      "loaded in 20 embeddings\n",
      "loaded in 20 embeddings\n",
      "INFO:chromadb.db.duckdb:loaded in 1 collections\n",
      "loaded in 1 collections\n",
      "loaded in 1 collections\n",
      "INFO:chromadb.db.duckdb:collection with name quickstart already exists, returning existing collection\n",
      "collection with name quickstart already exists, returning existing collection\n",
      "collection with name quickstart already exists, returning existing collection\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "INFO:chromadb.db.duckdb:Persisting DB to disk, putting it in the save folder: ./chroma_db\n",
      "Persisting DB to disk, putting it in the save folder: ./chroma_db\n",
      "Persisting DB to disk, putting it in the save folder: ./chroma_db\n",
      "INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "INFO:chromadb.db.duckdb:loaded in 40 embeddings\n",
      "loaded in 40 embeddings\n",
      "loaded in 40 embeddings\n",
      "INFO:chromadb.db.duckdb:loaded in 1 collections\n",
      "loaded in 1 collections\n",
      "loaded in 1 collections\n",
      "INFO:chromadb.db.duckdb:collection with name quickstart already exists, returning existing collection\n",
      "collection with name quickstart already exists, returning existing collection\n",
      "collection with name quickstart already exists, returning existing collection\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1877 tokens\n",
      "> [get_response] Total LLM token usage: 1877 tokens\n",
      "> [get_response] Total LLM token usage: 1877 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "The author grew up skipping a step in the evolution of computers, learning Italian, exploring Florence, painting people, working with technology companies, seeking signature styles at RISD, living in a rent-stabilized apartment, launching an online store builder, editing Lisp expressions, and publishing essays online.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# save to disk\n",
    "from chromadb.config import Settings\n",
    "\n",
    "db = chromadb.Client(\n",
    "    Settings(chroma_db_impl=\"duckdb+parquet\", persist_directory=\"./chroma_db\")\n",
    ")\n",
    "chroma_collection = db.get_or_create_collection(\"quickstart\")\n",
    "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, embed_model=embed_model\n",
    ")\n",
    "db.persist()\n",
    "\n",
    "# load from disk\n",
    "db2 = chromadb.Client(\n",
    "    Settings(chroma_db_impl=\"duckdb+parquet\", persist_directory=\"./chroma_db\")\n",
    ")\n",
    "chroma_collection = db2.get_or_create_collection(\"quickstart\")\n",
    "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_vector_store(\n",
    "    vector_store=vector_store, storage_context=storage_context, embed_model=embed_model\n",
    ")\n",
    "\n",
    "# Query Data from the persisted index\n",
    "query_engine = index.as_query_engine(chroma_collection=chroma_collection)\n",
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d596e475",
   "metadata": {},
   "source": [
    "## Basic Example (using the Docker Container)\n",
    "\n",
    "You can also run the Chroma Server in a Docker container separately, create a Client to connect to it, and then pass that to LlamaIndex. \n",
    "\n",
    "Here is how to clone, build, and run the Docker Image:\n",
    "```\n",
    "git clone git@github.com:chroma-core/chroma.git\n",
    "docker-compose up -d --build\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "d6c9bd64",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "> [build_index_from_nodes] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n",
      "> [build_index_from_nodes] Total embedding token usage: 17038 tokens\n"
     ]
    }
   ],
   "source": [
    "# create the chroma client and add our data\n",
    "import chromadb\n",
    "from chromadb.config import Settings\n",
    "\n",
    "remote_db = chromadb.Client(\n",
    "    Settings(\n",
    "        chroma_api_impl=\"rest\",\n",
    "        chroma_server_host=\"localhost\",\n",
    "        chroma_server_http_port=\"8000\",\n",
    "    )\n",
    ")\n",
    "remote_db.reset()  # resets the database\n",
    "chroma_collection = remote_db.get_or_create_collection(\"quickstart\")\n",
    "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, embed_model=embed_model\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "88e10c26",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "> [retrieve] Total LLM token usage: 0 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "> [retrieve] Total embedding token usage: 8 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1874 tokens\n",
      "> [get_response] Total LLM token usage: 1874 tokens\n",
      "> [get_response] Total LLM token usage: 1874 tokens\n",
      "INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n",
      "> [get_response] Total embedding token usage: 0 tokens\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "<b>\n",
       "The author grew up writing essays, learning Italian, exploring Florence, painting people, working with computers, studying at RISD, living in a rent-controlled apartment, building an online store builder, editing code, publishing essays online, writing essays, working on spam filters, cooking for groups, buying a building, and attending parties.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Query Data from the Chroma Docker index\n",
    "query_engine = index.as_query_engine(chroma_collection=chroma_collection)\n",
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0a0e79f7",
   "metadata": {},
   "source": [
    "## Update and Delete\n",
    "\n",
    "While building toward a real application, you want to go beyond adding data, and also update and delete data. \n",
    "\n",
    "Chroma has users provide `ids` to simplify the bookkeeping here. `ids` can be the name of the file, or a combined has like `filename_paragraphNumber`, etc.\n",
    "\n",
    "Here is a basic example showing how to do various operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "d9411826",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'node_info': '{\"start\": 0, \"end\": 4040, \"_node_type\": \"1\"}', 'relationships': '{\"1\": \"a0294b91-ff5f-45fe-b249-5596a18cc952\", \"3\": \"95771df1-9ec9-4128-9a11-ac92b768e2e3\"}', 'document_id': 'a0294b91-ff5f-45fe-b249-5596a18cc952', 'doc_id': 'a0294b91-ff5f-45fe-b249-5596a18cc952', 'ref_doc_id': 'a0294b91-ff5f-45fe-b249-5596a18cc952', 'author': 'Paul Graham'}\n",
      "count before 20\n",
      "count after 19\n"
     ]
    }
   ],
   "source": [
    "doc_to_update = chroma_collection.get(limit=1)\n",
    "doc_to_update[\"metadatas\"][0] = {\n",
    "    **doc_to_update[\"metadatas\"][0],\n",
    "    **{\"author\": \"Paul Graham\"},\n",
    "}\n",
    "chroma_collection.update(\n",
    "    ids=[doc_to_update[\"ids\"][0]], metadatas=[doc_to_update[\"metadatas\"][0]]\n",
    ")\n",
    "updated_doc = chroma_collection.get(limit=1)\n",
    "print(updated_doc[\"metadatas\"][0])\n",
    "\n",
    "# delete the last document\n",
    "print(\"count before\", chroma_collection.count())\n",
    "chroma_collection.delete(ids=[doc_to_update[\"ids\"][0]])\n",
    "print(\"count after\", chroma_collection.count())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "0ac390d292208ca2380c85f5bce7ded36a7a25670a97c40b8009630eb36cb06e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
